14 research outputs found
Maat: Performance Metric Anomaly Anticipation for Cloud Services with Conditional Diffusion
Ensuring the reliability and user satisfaction of cloud services necessitates
prompt anomaly detection followed by diagnosis.
Existing techniques for anomaly detection focus solely on real-time
detection, meaning that anomaly alerts are issued as soon as anomalies occur.
However, anomalies can propagate and escalate into failures, making
faster-than-real-time anomaly detection highly desirable for expediting
downstream analysis and intervention.
This paper proposes Maat, the first work to address anomaly anticipation of
performance metrics in cloud services.
Maat adopts a novel two-stage paradigm for anomaly anticipation, consisting
of metric forecasting and anomaly detection on forecasts.
The metric forecasting stage employs a conditional denoising diffusion model
to enable multi-step forecasting in an auto-regressive manner.
The detection stage extracts anomaly-indicating features based on domain
knowledge and applies isolation forest with incremental learning to detect
upcoming anomalies.
Thus, our method can uncover anomalies that better conform to human
expertise.
Evaluation on three publicly available datasets demonstrates that Maat can
anticipate anomalies faster than real-time comparatively or more effectively
compared with state-of-the-art real-time anomaly detectors.
We also present cases highlighting Maat's success in forecasting abnormal
metrics and discovering anomalies.Comment: This paper has been accepted by the Research track of the 38th
IEEE/ACM International Conference on Automated Software Engineering (ASE
2023
Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention
Prompt and accurate detection of system anomalies is essential to ensure the
reliability of software systems. Unlike manual efforts that exploit all
available run-time information, existing approaches usually leverage only a
single type of monitoring data (often logs or metrics) or fail to make
effective use of the joint information among different types of data.
Consequently, many false predictions occur. To better understand the
manifestations of system anomalies, we conduct a systematical study on a large
amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates
that logs and metrics can manifest system anomalies collaboratively and
complementarily, and neither of them only is sufficient. Thus, integrating
heterogeneous data can help recover the complete picture of a system's health
status. In this context, we propose Hades, the first end-to-end semi-supervised
approach to effectively identify system anomalies based on heterogeneous data.
Our approach employs a hierarchical architecture to learn a global
representation of the system status by fusing log semantics and metric
patterns. It captures discriminative features and meaningful interactions from
heterogeneous data via a cross-modal attention module, trained in a
semi-supervised manner. We evaluate Hades extensively on large-scale simulated
data and datasets from Huawei Cloud. The experimental results present the
effectiveness of our model in detecting system anomalies. We also release the
code and the annotated dataset for replication and future research.Comment: In Proceedings of the 2023 IEEE/ACM 45th International Conference on
Software Engineering (ICSE). arXiv admin note: substantial text overlap with
arXiv:2207.0291
Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection
Performance issues permeate large-scale cloud service systems, which can lead
to huge revenue losses. To ensure reliable performance, it's essential to
accurately identify and localize these issues using service monitoring metrics.
Given the complexity and scale of modern cloud systems, this task can be
challenging and may require extensive expertise and resources beyond the
capacity of individual humans. Some existing methods tackle this problem by
analyzing each metric independently to detect anomalies. However, this could
incur overwhelming alert storms that are difficult for engineers to diagnose
manually. To pursue better performance, not only the temporal patterns of
metrics but also the correlation between metrics (i.e., relational patterns)
should be considered, which can be formulated as a multivariate metrics anomaly
detection problem. However, most of the studies fall short of extracting these
two types of features explicitly. Moreover, there exist some unlabeled
anomalies mixed in the training data, which may hinder the detection
performance. To address these limitations, we propose the Relational- Temporal
Anomaly Detection Model (RTAnomaly) that combines the relational and temporal
information of metrics. RTAnomaly employs a graph attention layer to learn the
dependencies among metrics, which will further help pinpoint the anomalous
metrics that may cause the anomaly effectively. In addition, we exploit the
concept of positive unlabeled learning to address the issue of potential
anomalies in the training data. To evaluate our method, we conduct experiments
on a public dataset and two industrial datasets. RTAnomaly outperforms all the
baseline models by achieving an average F1 score of 0.929 and Hit@3 of 0.920,
demonstrating its superiority
A Large-scale Benchmark for Log Parsing
Log data is pivotal in activities like anomaly detection and failure
diagnosis in the automated maintenance of software systems. Due to their
unstructured format, log parsing is often required to transform them into a
structured format for automated analysis. A variety of log parsers exist,
making it vital to benchmark these tools to comprehend their features and
performance. However, existing datasets for log parsing are limited in terms of
scale and representativeness, posing challenges for studies that aim to
evaluate or develop log parsers. This problem becomes more pronounced when
these parsers are evaluated for production use. To address these issues, we
introduce a new collection of large-scale annotated log datasets, named LogPub,
which more accurately mirrors log data observed in real-world software systems.
LogPub comprises 14 datasets, each averaging 3.6 million log lines. Utilizing
LogPub, we re-evaluate 15 log parsers in a more rigorous and practical setting.
We also propose a new evaluation metric to lessen the sensitivity of current
metrics to imbalanced data distribution. Furthermore, we are the first to
scrutinize the detailed performance of log parsers on logs that represent rare
system events and offer comprehensive information for system troubleshooting.
Parsing such logs accurately is vital yet challenging. We believe that our work
could shed light on the design and evaluation of log parsers in more realistic
settings, thereby facilitating their implementation in production systems
Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems
Ensuring the reliability of cloud systems is critical for both cloud vendors
and customers. Cloud systems often rely on virtualization techniques to create
instances of hardware resources, such as virtual machines. However,
virtualization hinders the observability of cloud systems, making it
challenging to diagnose platform-level issues. To improve system observability,
we propose to infer functional clusters of instances, i.e., groups of instances
having similar functionalities. We first conduct a pilot study on a large-scale
cloud system, i.e., Huawei Cloud, demonstrating that instances having similar
functionalities share similar communication and resource usage patterns.
Motivated by these findings, we formulate the identification of functional
clusters as a clustering problem and propose a non-intrusive solution called
Prism. Prism adopts a coarse-to-fine clustering strategy. It first partitions
instances into coarse-grained chunks based on communication patterns. Within
each chunk, Prism further groups instances with similar resource usage patterns
to produce fine-grained functional clusters. Such a design reduces noises in
the data and allows Prism to process massive instances efficiently. We evaluate
Prism on two datasets collected from the real-world production environment of
Huawei Cloud. Our experiments show that Prism achieves a v-measure of ~0.95,
surpassing existing state-of-the-art solutions. Additionally, we illustrate the
integration of Prism within monitoring systems for enhanced cloud reliability
through two real-world use cases.Comment: The paper was accepted by the 38th IEEE/ACM International Conference
on Automated Software Engineering (ASE 2023
Corrigendum to: The TianQin project: current progress on science and technology
In the originally published version, this manuscript included an error related to indicating the corresponding author within the author list. This has now been corrected online to reflect the fact that author Jun Luo is the corresponding author of the article
Reliability Improved Cooperative Communication over Wireless Sensor Networks
With the development of smart devices and connection technologies, Wireless Sensor Networks (WSNs) are becoming increasingly intelligent. New or special functions can be obtained by receiving new versions of program codes to upgrade their software systems, forming the so-called smart Internet of Things (IoT). Due to the lossy property of wireless channels, data collection in WSNs still suffers from a long delay, high energy consumption, and many retransmissions. Thanks to wireless software-defined networks (WSDNs), software in sensors can now be updated to help them transmit data cooperatively, thereby achieving more reliable communication. In this paper, a Reliability Improved Cooperative Communication (RICC) data collection scheme is proposed to improve the reliability of random-network-coding-based cooperative communications in multi-hop relay WSNs without reducing the network lifetime. In WSNs, sensors in different positions can have different numbers of packets to handle, resulting in the unbalanced energy consumption of the network. In particular, nodes in non-hotspot areas have up to 90% of their original energy remaining when the network dies. To efficiently use the residual energy, in RICC, high data transmission power is adopted in non-hotspot areas to achieve a higher reliability at the cost of large energy consumption, and relatively low transmission power is adopted in hotspot areas to maintain the long network lifetime. Therefore, high reliability and a long network lifetime can be obtained simultaneously. The simulation results show that compared with other scheme, RICC can reduce the end-to-end Message Fail delivering Ratio (MFR) by 59.4%–62.8% under the same lifetime with a more balanced energy utilization
Heterogeneous Anomaly Detection for Software Systems via Attentive Multi-modal Learning
Prompt and accurate detection of system anomalies is essential to ensure the
reliability of software systems. Unlike manual efforts that exploit all
available run-time information, existing approaches usually leverage only a
single type of monitoring data (often logs or metrics) or fail to make
effective use of the joint information among multi-source data. Consequently,
many false predictions occur. To better understand the manifestations of system
anomalies, we conduct a comprehensive empirical study based on a large amount
of heterogeneous data, i.e., logs and metrics. Our study demonstrates that
system anomalies could manifest distinctly in different data types. Thus,
integrating heterogeneous data can help recover the complete picture of a
system's health status. In this context, we propose HADES, the first work to
effectively identify system anomalies based on heterogeneous data. Our approach
employs a hierarchical architecture to learn a global representation of the
system status by fusing log semantics and metric patterns. It captures
discriminative features and meaningful interactions from multi-modal data via a
novel cross-modal attention module, enabling accurate system anomaly detection.
We evaluate HADES extensively on large-scale simulated and industrial datasets.
The experimental results present the superiority of HADES in detecting system
anomalies on heterogeneous data. We release the code and the annotated dataset
for reproducibility and future research